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Enabling Multi-level Trust in Privacy 
Preserving Data Mining 

Yaping Li, Minghua Chen, Qiwei Li and Wei Zliang 

Abstract — Privacy Preserving Data Mining (PPDM) addresses the problem of developing accurate models about aggregated 
data without access to precise information in individual data record. A widely studied perturbation-based PPDM approach 
introduces random perturbation to individual values to preserve privacy before data is published. Previous solutions of this 
approach are limited in their tacit assumption of single-level trust on data miners. 

In this work, we relax this assumption and expand the scope of perturbation-based PPDM to Multi-Level Trust (MLT-PPDM). In 
our setting, the more trusted a data miner is, the less perturbed copy of the data it can access. Under this setting, a malicious data 
miner may have access to differently perturbed copies of the same data through various means, and may combine these diverse 
copies to jointly infer additional information about the original data that the data owner does not intend to release. Preventing 
such diversity attacl^s is the key challenge of providing MLT-PPDM services. We address this challenge by properly correlating 
perturbation across copies at different trust levels. We prove that our solution is robust against diversity attacks with respect to 
our privacy goal. That is, for data miners who have access to an arbitrary collection of the perturbed copies, our solution prevent 
them from jointly reconstructing the original data more accurately than the best effort using any individual copy in the collection. 
Our solution allows a data owner to generate perturbed copies of its data for arbitrary trust levels on-demand. This feature offers 
data owners maximum flexibility 

Index Terms — Privacy preserving data mining, multi-level trust, random perturbation 
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0\ 1 Introduction 

pata perturbation, a widely employed and accepted 
Privacy Preserving Data Mining (PPDM) approach, 
tacitly assumes single-level trust on data miners. This 
approach introduces uncertainty about individual val- 
' ues before data is published or released to third 
. . parties for data mining purposes HJ, Q, US], E), |5l, 
^ 13 • Under the single trust level assumption, a data 
k> owner generates only one perturbed copy of its data 
5_j with a fixed amount of uncertainty. This assumption 
Ci is limited in various applications where a data owner 
trusts the data miners at different levels. 

We present below a two trust level scenario as a 
motivating example. 

• The government or a business might do internal 
(most trusted) data mining, but they may also 
want to release the data to the public, and might 
perturb it more. The mining department which 
receives the less perturbed internal copy also 
has access to the more perturbed public copy. It 
would be desirable that this department does not 
have more power in reconstructing the original 
data by utilizing both copies than when it has 
only the internal copy. 
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she is now with Department of Information Engineering at the Chinese 
University of Hong Kong. E-mail: yaping@eecs.berkeley.edu. 
M. Chen, Qiwei Li and Wei Zhang are with Department of Information 
Engineering at the Chinese University of Hong Kong, Shatin, Hong Kong, 
China. E-mail: {mingiiua, IqivOOS, zw007}@ie.ciihk.edu.hk. 



• Conversely, if the internal copy is leaked to the 
public, then obviously the public has all the 
power of the mining department. However, it 
would be desirable if the public cannot recon- 
struct the original data more accurately when it 
uses both copies than when it uses only the 
leaked internal copy. 
This new dimension of Multi-Level Trust (MLT) 
poses new challenges for perturbation based PPDM. 
In contrast to the single-level trust scenario where 
only one perturbed copy is released, now multiple 
differently perturbed copies of the same data is avail- 
able to data miners at different trusted levels. The 
more trusted a data miner is, the less perturbed copy 
it can access; it may also have access to the perturbed 
copies available at lower trust levels. Moreover, a 
data miner could access multiple perturbed copies 
through various other means, e.g., accidental leakage 
or colluding with others. 

By utilizing diversity across differently perturbed 
copies, the data miner may be able to produce a more 
accurate reconstruction of the original data than what 
is allowed by the data owner. We refer to this attack 
as a diversity attack. It includes the colluding attack 
scenario where adversaries combine their copies to 
mount an attack; it also includes the scenario where 
an adversary utilizes public information to perform 
the attack on its own. Preventing diversity attacks is 
the key challenge in solving the MLT-PPDM problem. 

In this paper, we address this challenge in enabling 
MLT-PPDM services. In particular, we focus on the ad- 
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ditive perturbation approach where random Gaussian 
noise is added to the original data with arbitrary dis- 
tribution, and provide a systematic solution. Through 
a one-to-one mapping, our solution allows a data 
owner to generate distinctly perturbed copies of its 
data according to different trust levels. Defining trust 
levels and determining such mappings are beyond the 
scope of this paper. 

1.1 Contributions 

We make the following contributions: 

• We expand the scope of perturbation based 
PPDM to multi-level trust, by relaxing the im- 
plicit assumption of single-level trust in existing 
work. MLT-PPDM introduces another dimension 
of flexibility which allows data owners to gen- 
erate differently perturbed copies of its data for 
different trust levels. 

• We identify a key challenge in enabling MLT- 
PPDM services. In MLT-PPDM, data miners may 
have access to multiple perturbed copies. By 
combining multiple perturbed copies, data min- 
ers may be able to perform diversity attacks to 
reconstruct the original data more accurately than 
what is allowed by the data owner. Defending 
such attacks is challenging, which we explain 
through a case study in Section ID 

• We address this challenge by properly correlating 
perturbation across copies at different trust levels. 
We prove that our solution is robust against di- 
versity attacks. We propose several algorithms for 
different targeting scenarios. We demonstrate the 
effectiveness of our solution through experiments 
on real data. 

« Our solution allows data owners to generate per- 
turbed copies of their data at arbitrary trust levels 
on-demand. This property offers data owners 
maximum flexibility. 

1.2 Related Worl< 

Privacy Preserving Data Mining (PPDM) was first 
proposed in |2| and |[8l simultaneously. To address 
this problem, researchers have since proposed various 
solutions that fall into two broad categories based on 
the level of privacy protection they provide. The first 
category of the Secure Multiparty Computation (SMC) 
approach provides the strongest level of privacy; it 
enables mutually distrustful entities to mine their 
collective data without revealing anything except for 
what can be inferred from an entity's own input and 
the output of the mining operation alone O, HI. 
In principle, any data mining algorithm can be im- 
plemented by using generic algorithms of SMC ITOll . 
However, these algorithms are extraordinarily expen- 
sive in practice, and impractical for real use. To avoid 
the high computational cost, various solutions that 



are more efficient than generic SMC algorithms have 
been proposed for specific mining tasks. Solutions to 
build decision trees over the horizontally partitioned 
data were proposed in |8| . For vertically partitioned 
data, algorithms have been proposed to address the 
association rule mining ||9l, fc-means clustering IITTll , 
and frequent pattern mining problems {I2l. The work 
of flSlI uses a secure coprocessor for privacy preserv- 
ing collaborative data mining and analysis. 

The second category of the partial information hid- 
ing approach trades privacy with improved perfor- 
mance in the sense that malicious data miners may 
infer certain properties of the original data from the 
disguised data. Various solutions in this category 
allow a data owner to transform its data in different 
ways to hide the true values of the original data 
while at the same time still permit useful mining 
operations over the modified data. This approach 
can be further divided into three categories: (a) k- 
anonymity HI, O, HI, lEl, IH, Gil, (b) retention 
replacement (which retains an element with proba- 
bility p or replaces it with an element selected from 
a probability distribution function on the domain of 
the elements) HSO), Ell, (23, and (c) data perturba- 
tion (which introduces uncertainty about individual 
values before data is published) CJ, 0, 0, El, 0, 

The data perturbation approach includes two main 
classes of methods: additive H), 0, ID, 0, Q and 
matrix multiplicative 0, schemes. These methods 
apply mainly to continuous data. In this paper, we 
focus solely on the additive perturbation approach 
where noise is added to data values. 

Another relevant line of research concerns the 
problem of privately computing various set related 
operations. Two party protocols for intersection, in- 
tersection size, equijoin, and equijoin size were in- 
troduced in [24] for honest-but-curious adversarial 
model. Some of the proposed protocols leak informa- 
tion Il25ll . Similar protocols for set intersection have 
been proposed in 1261 , |27|. Efficient two party pro- 
tocols for the private matching problem which are 
both secure in the malicious and honest-but-curious 
models were introduced in l28ll . Efficient private and 
threshold set intersection protocols were proposed 
in Il29ll . While most of these protocols are equality 
based, algorithms in | |25l compute arbitrary join pred- 
icates leveraging the power of a secure coprocessor. 
Tiny trusted devices were used for secure function 
evaluation in |30|. 

Our work does not re-anonymizing a dataset after 
it is updated with insertions and / or deletions, which 
is a topic studied by the authors in ISTl , l32ll , Il33ll , 
1341 . Instead, we study anonymizing the same dataset 
at multiple trust levels. The two problems are orthog- 
onal. 

An earlier version of this paper appeared in l35l and 
initiated the topic of MLT-PPDM. Recently, Xiao et al. 
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proposed an algorithm of multi-level uniform pertur- 
bation |36|. Our paper differs from |'36] in three main 
aspects. Firstly, the two papers address different prob- 
lems and tackle the problems under different privacy 
measures. We propose multi-level privacy preserving 
for additive Gaussian noise perturbation, and use a 
measure based on how closely the original values 
can be reconstructed from the perturbed data [21, ID, 
||5l . While Il36l presents an algorithm of multi-level 
uniform perturbation, and studies its performance 
using the pi — p2 privacy measure |[37). As a result, 
neither the solution in l36l can be easily applied to 
the problem in this paper nor the solution in this 
paper can be directly applied to the problem in [36J. 
Secondly, based on Gaussian noise perturbation, the 
solution in this paper is more suitable for high- 
dimensional data, as compared to that in |l36l| based on 
uniform perturbation [38J. Thirdly, We present several 
nontrivial theoretical results. We discuss reconstruc- 
tion errors under independence noise, analyze the se- 
curity of our scheme when collusion occurs, and study 
the computational complexities based on Kroneckor 
product. These results provide fundamental insights 
into the problem. 



1 .3 Paper Layout 

The rest of the paper is organized as follows. We 
go over preliminaries in Section |2l We formulate the 
problem, and define our privacy goal in Section |3l 
In Section m we present a simple but important case 
study. It highlights the key challenge in achieving our 
privacy goal, and presents the intuition that leads 
to our solution. In Section |5l we formally present 
our solution, and prove that it achieves our privacy 
goal. Algorithms that target different scenarios are 
also proposed, and their complexities are studied. 
We carry out extensive experiments on real data in 
Section [6] to verify our theoretical analysis. Section [7] 
concludes the paper. 



2 Preliminaries 
2.1 Jointly Gaussian 

In this paper, we focus on perturbing data by additive 
Gaussian noise HI, IS, E), ||5l, IZl, i.e., the added 
noises are jointly Gaussian.^ 

Let Gi through Gl be L Gaussian random vari- 
ables. They are said to be jointly Gaussian if and only 
if each of them is a linear combination of multiple in- 



1. Note that we do not make any assumptions about the distri- 
bution of the data. 



dependent Gaussian random variables.^ Equivalently, 
Gl through Gl are jointly Gaussian if and only if any 
linear combination of them is also a Gaussian random 
variable. 

A vector formed by jointly Gaussian random vari- 
ables is called a jointly Gaussian vector. For a jointly 
Gaussian vector G = [Gi, . . . , Gl]^, its probability 
density function (PDF) is as follows: for any real 
vector g, 



h{g) 



1 



where /ic and A'g are the mean vector and covariance 
matrix of G, respectively. 

Note that not all Gaussian random variables are 
jointly Gaussian. For example, let Gi be a zero mean 
Gaussian random variable with a positive variance, 
and define G2 as 



Go = 



Gl, 

— Gl, 



if |Gi| < 1; 
otherwise. 



where |Gi| is the absolute value of Gi. It is straight- 
forward to verify that G2 is Gaussian, but Gi + G2 is 
not. Therefore, Gi and G2 are not jointly Gaussian. 

If multiple random variables are jointly Gaussian, 
then conditional on a subset of them, the remaining 
variables are still jointly Gaussian. Specifically, parti- 
tion a jointly Gaussian vector G as 



and 



G = 



Kr, 



Gl' 

G2 



K21 K22 



accordingly. Then the distribution of G2 given Gi = wi 
is also a jointly Gaussian with mean P2 + K2iK^-^ [vi — 
/ii) and covariance matrix K22 ~ K2iK^^ K21 l39l 
ch 2.5]. This is a key property of jointly Gaussian 
variables. We utilize this property in Section |531 

2.2 Additive Perturbation 

The single-level trust PPDM problem via data pertur- 
bation has been widely studied in literature. In this 
setting, a data owner implicitly trusts all recipients of 
its data uniformly and distributes a single perturbed 
copy of the data. 

A widely used and accepted way to perturb data 
is by additive perturbation HI, EJ, d, O, IZl. This 

2. Two random variables are independent if knowing the value 
of one yields no knowledge about that of the other. Mathemat- 
ically, two random variables Gi and G2 are independent if, for 
any values 31 and 32, /ci.Ga (Si- 92) = /d (3i)/g2 (92), where 
/Gi,G9 92) is the joint probability density function of Gi and 
G2, and fci (gi) and /g, (32) are the probability density functions 
of Gl and G2, respectively. Generally, random variables Gi through 
Gl are mutually independent if, for any values gi through gj^, 
/Gi,...,Gt(9l, ■■■,9l) = /Gi(9l)---/Gf,(9L). 
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approach adds to the original data, X, some random 
noise, Z, to obtain the perturbed copy, Y, as follows: 

Y = X + Z. (1) 

We assume that X, Y, and Z are all iV-dimension 
vectors where is the number of attributes in X. 
Let Xj,yj, and Zj be the j*'' entry of X, Y, and Z 
respectively. 

The original data X follows a distribution with 
mean vector nx and covariance matrix Kx- The co- 
variance Kx is an NxN positive semi-definite matrix 
given by 

Kx=E[{X-^ix){X-^lxf], (2) 

which is a diagonal matrix if the attributes in X are 
uncorrelated. 

The noise Z is assumed to be independent of X 
and is a jointly Gaussian vector with zero mean and 
covariance matrix Kz chosen by the data owner. In 
short, we write it as Z ^ N(0,Kz)- The covariance 
matrix Kz is an N x N positive semi-definite matrix 
given by 

Kz = E [ZZ^] . (3) 

It is straightforward to verify the mean vector of Y 
is also 1-1 X, and its covariance matrix, denoted by Ky, 
is 

Ky = Kx + Kz- 

The perturbed copy Y is published or released to 
data miners. Equation [T] models both the cases where 
a data miner sees a perturbed copy of X, and where it 
knows the true values of certain attributes. The latter 
scenario is considered in recent work f7\ where the 
authors show that sophisticated filtering techniques 
utilizing the true value leaks can help recover X. 

In general, given Y , a malicious data miner's goal 
is to reconstruct X by filtering out the added noise. 
The authors of |4| point out that the attributes in X 
and the added noise should have the same correlation, 
otherwise the noise can be easily filtered out. This 
observation essentially requires to choose Kz to be 
proportional to Kx ID, i.e., Kz = <y%Kx for some 
constant a% denoting the perturbation magnitude. 

2.3 Linear Least Squares Error Estimation 

Given a perturbed copy of the data, a malicious data 
miner may attempt to reconstruct the original data 
as accurately as possible. Among the family of linear 
reconstruction methods, where estimates can only be 
linear fimctions of the perturbed copy. Linear Least 
Squares Error (LLSE) estimation has the minimum 
square errors between the estimated values and the 
original values ||39j ch 7.1-7.2]. 

The LLSE estimate of X given Y , denoted by X{Y), 
is (see Appendix lAl for the deduction) 

x{y) = KxyK-^{y - ^lx) + ^ix, (4) 



where Kxy {Ky resp.) is the covariance matrix of X 
and Y {Y resp.). Kxy is given by 

Kxy = E[{X ^ ^ix){Y ^ E[Y]f] 

= E[{X - ^ix){{X - fix) + (z - 0)f] 
= Kx+O^Kx. 

Note in the above derivation, we compute E[(X — 
j_ix)Z^] = E[{X - fix)]E[Z'^] = 0, since X and Z are 
independent. 

The square estimation errors between the LLSE 
estimates and the original values of the attributes in 
X are the diagonal terms of the covariance matrix of 
X — X{Y). An important property of LLSE estimation 
is that it simultaneously minimizes all these estima- 
tion errors. 

2.4 Kroneclter Product 

In the MLT-PPDM problem, the covariance matrix of 
noises can be written as the Kronecker product [40J of 
two matrices. In this paper, we explore the properties 
of the Kronecker product for efficient computation. 

The Kronecker product [40J is a binary matrix oper- 
ator that maps two matrices of arbitrary dimensions 
into a larger matrix with a special block structure. 
Given an n x m matrix A and p x q matrix B, where 

Oil • • • aim 

their Kronecker product, denoted as A®B, is an np x 
mq matrix with the block structure 

aiiB ■ ■ ■ aimB 

_0-nlB ■ ■ ■ anmB_ 

We list several properties of Kronecker product that 
will be used later. Assume that A, B, C and D are 
matrices and their dimensions are appropriate for the 
computation in each property, we have 

1) (aA) (g)B = A(g) {aB) = a{A ® B), where a eR; 

2) {A ® B)'^ = A^(g> B'^; 

3) {A(g>B)-^ ^ A"^ (g, B-^; 

4) {A(g, B){C (g) D) = AC (E) BD; 

5) vec{ABC) = [C^ ® A)vec{B), where vec(-) de- 
notes the vectorization of a matrix formed by 
stacking the columns of the matrix into a single 
column vector. 

3 Probleim Forimulation 

In this section, we present the problem settings, de- 
scribe our threat model, state our privacy goal, and 
identify the design space. Table [1] lists the key nota- 
tions used in the paper. 
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TABLE 1 
Key Notations 



Notation 


Definition 


X 


original data 


Yi 


perturbed copy of X of trust level % 


Zi 


noise added to X to generate Yi 


M 


number of trust levels 


N 


number of attributes in X 


Y 


a vector of all J\/ perturbed copies 


Z 


a vector of noise Z\ to Zm 


X{Y) 


LLSE estimate of X given Y 


Kx 


covariance matrix of X 


Ki 


covariance matrix of TL 



3.1 Problem Settings 

In the MLT-PPDM problem we consider in this paper, 
a data owner trusts data miners at different levels and 
generates a series of perturbed copies of its data for 
different trust levels. This is done by adding varying 
amount of noise to the data. 

Under the multi-level trust setting, data miners at 
higher trust levels can access less perturbed copies. 
Such less perturbed copies are not accessible by data 
miners at lower trust levels. In some scenarios, such 
as the motivating example we give at the beginning 
of Section [TJ data miners at higher trust levels may 
also have access to the perturbed copies at more than 
one trust levels. Data miners at different trust levels 
may also collude to share the perturbed copies among 
them. As such, it is common that data miners can have 
access to more than one perturbed copies. 

Specifically, we assume that the data owner wants 
to release M perturbed copies of its data X, which is 
an X 1 vector with mean [ix and covariance Kx as 
defined in Section 12.21 These M copies can be gener- 
ated in various fashions. They can be jointly generated 
all at once. Alternatively, they can be generated at 
different times upon receiving new requests from data 
miners, in an on-demand fashion. The latter case gives 
data owners maximum flexibility. 

It is true that the data owner may consider to release 
only the mean and covariance of the original data. We 
remark that simply releasing the mean and covariance 
does not provide the same utility as the perturbed 
data. For many real applications, knowing only the 
mean and covariance may not be sufficient to apply 
data mining techniques, such as clustering, principal 
component analysis, and classification |6l. By using 
random perturbation to release the dataset, the data 
owner allows the data miner to exploit more statisti- 
cal information without releasing the exact values of 
sensitive attributes Ql, Hi). 

Let Y = , . . . , be the vector of all perturbed 
copies Y,{\ <i < M). Let Z = [Zf, Zjjf be the 
vector of noise. Let H be an {N ■ M) x N matrix as 



follows: 

" In ' 
. In _ 

where In represents an x iV identity matrix. 

We have the relationship between Y, X and Z as 
follows: 









' In ' 






Y = 


Ym 




In 


x + 


Zm 



where Zi,l < i < M are independent of X. To be 
robust against advanced filtering attacks, individual 
noise terms in Zi added to different attributes in X 
should have the same correlations as the attributes 
themselves, otherwise Zi can be easily filtered out |4|. 
As such, we have 

Kz, = alJ<x, and Ky, = (1 + 

where (t| is a constant of the perturbation magnitude. 
The data owner chooses a value for cr^, according to 
the trust level associated with the target perturbed 
copy Yi. 

3.2 Threat Model 

We assume malicious data miners who always at- 
tempt to reconstruct a more accurate estimate of the 
original data given perturbed copies. We hence use 
the terms data miners and adversaries interchange- 
ably throughout this paper. In MLT-PPDM, adver- 
saries may have access to a subset of the perturbed 
copies of the data. The adversaries' goal is to recon- 
struct the original data as accurately as possible based 
on all available perturbed copies. 

The reconstruction accuracy depends heavily on 
the adversaries' knowledge. We make the same as- 
sumption as the one in H) that adversaries have the 
knowledge of the statistics of the original data X and 
the noise Z, i.e., mean nx, and covariance matrices 
Kx and Kz- Note the adversaries with less knowledge 
are weaker than the ones we study in this paper. 

In addition, we assume adversaries only perform 
linear estimation attacks, where estimates can only be 
linear functions of the perturbed data Y. It is known 
that if X follows a jointly Gaussian distribution, then 
LLSE estimation achieves the minimum estimation 
error among both linear and nonlinear estimation 
methods. For X with general distribution, LLSE es- 
timation has the minimum estimation error among 
all linear estimation methods. Various recent work in 
perturbation based PPDM, such as [4] and [5], makes 
this assumption of linear estimation. See reference Q 
for a comprehensive review. 
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Noticed Kxy = KxH'^ and Ky = HKxH'^ + Kz, 
the LLSE estimate ^(Y) of X given Y can be ex- 
pressed as: 

X{Y) = KxyK-^ - E[Y]) + fix 



KxH^ [HKxH^ 



Kz] (Y - H^^Lx) 



(6) 



In our setting, X{Y) is the most accurate estimate of 
X that an adversary can possibly make. The corre- 
sponding estimation errors of attributes in X are the 
diagonal terms of the covariance matrix of ^(Y) — X. 
Using Equation [6l we can compute the covariance 
matrix as follows: 



K 



X 



X{Y) - Xj [X{Y) 
- KxH'^Ky^HK 



X 



X 



^x 



For an adversary who observes only a single copy 
Y^{1 < i < M) and gets a LLSE estimate X{Y.i), the 
covariance matrix of X{Yi) — X has a simple form as 
follows: 



E 



{X{Y,) 



X) {X{Y,)-X 



Kx - KxKy'Kx = 



-K 



X- 



(8) 



3.3 Definitions 

3.3.1 Distortion 

To facilitate future discussion on privacy, we define 
the concept of perturbation V between two datasets as 
the average expected square difference between them. 
For example, the distortion between the original data 
X and the perturbed copy Y as defined in Section |Z21 
is given by: 

It is easy to see that V{X, Y) = V{Y, X). 

Based on the above definition, we refer to a per- 
turbed copy Y2 to be more perturbed than Yi with 
respect to X if and only if V{X, Y2) > V{X, Yi). 

3.3.2 Privacy under Single-level Trust Setting 

With respect to the original data X, the privacy of a 
perturbed copy Y represents how well the true values 
of X is hidden in Y. 

A more perturbed copy of the data does not nec- 
essarily have more privacy since the added noise 
may be intelligently filtered out. Consequently, we 
define the privacy of a perturbed copy by taking into 
account an adversary's power in reconstructing the 
original data. We define the privacy of Y with respect 
to X to be 'D{X,X{Y)), i.e., the distortion between 



X and the LLSE estimate X{Y). A larger distortion 
hides the original values better (and thus preserves 
more privacy), so we refer to a perturbed data Y2 to 
preserve more privacy than Yi with respect to X if and 
only if V{X,X{Y2)) > V{X,X{Yi)). 

3.3.3 Privacy under Multi-level Trust Setting 

We now define privacy for the multi-level trust case 
in the same spirit of the single-level trust case. 

For a vector Y = [Yj^,--- ,Y^^]^ of M perturbed 
copies of X, the privacy of Y represents how well the 
true values of X is hidden in the multiple perturbed 
copies Y. The privacy of Y, with respect to X, is 
defined as T>(X, X{Y)), the distortion between X and 
its LLSE estimate X(Y). 

^^j4 Privacy Goal and Design Space 

In a MLT-PPDM setting, a data owner releases dis- 
tinctly perturbed copies of its data to multiple data 
miners. One key goal of the data owner is to control 
the amoimt of information about its data that adver- 
saries may derive. 

We assume that the data owner wants to distribute 
a total of M different perturbed copies of its data, i.e., 
Yi(l < i < M), each for a trust level i. The assumption 
of M is for ease of analysis. It will become clear later 
that our solution of the on-demand generation allows 
a data owner to generate as many different copies as 
it wishes. 

The data owner can easily control the amount of 
the information about its data an attacker may infer 
from a single perturbed copy. Utilizing Equation |8l we 
express the privacy of Yi, i.e., T>{X, X{Yi)), as follows: 

ViX,X{Y,)) 
1 



N 



-Tr E 



1 N 



X{Y,)-X) \X{Y,)-X 



Tr {Kx). 



(9) 



where Tr{-) represents the trace of a matrix. 

The data owner can easily control the privacy of an 
individual copy Yj by setting ct^. according to trust 
level i through a one-to-one mapping. Defining trust 
levels and such mappings are beyond the scope of this 
paper. 

However, such control alone is not sufficient in the 
face of diversity attacks. Adversaries that can access 
copies at different trust levels enjoy the diversity 
gain when they combine multiple distinctly perturbed 
copies to estimate the original data. We discuss one 
such case in Section [4.2.11 

Ideally, the amount of information about X that 
adversaries can jointly infer from multiple perturbed 
copies should be no more than that of the best effort 
using any individual copy. 

Formally, we say the privacy goal is achieved with 
respect to M perturbed copies Y; , 1 < i < M, if the 
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following statement holds. For an arbitrary subset Yc 
of {Y„l<i<M}, 

V{X, X{Yc)) = min V{X, X{0). (10) 

where Yc is the set of perturbed copies an adversary 
uses to reconstruct the original data. 

Intuitively, achieving the privacy goal requires that 
given the copy with the least privacy among any 
subset of these M perturbed copies, the remaining 
copies in that subset contain no extra information 
about X. 

To achieve this goal, the available design space is 
noise Z. We already determine that individual noise 
Z,,l < i < M must follow A^(0, cr| Jix)- In the 
rest of the paper, we show by properly correlating 
noise Zi,l < i < M, the desired privacy goal can be 
achieved. 

4 Case Study 

In this section, we study a basic case corresponding to 
the motivating example we described at the beginning 
of Section[l] In the case, a data miner has access to two 
differently perturbed copies of the same data, each for 
a different trust level. We present the challenges in 
achieving the privacy goal in Equation [lO] with two 
false starts. As we develop a solution to this basic 
base, we show the key ideas in solving the more 
general case of arbitrarily fine granularity of trust 
levels. 

4.1 An Illustrative Case 

For ease of illustration, we assume single attribute 
data. We assume that the data owner has already 
distributed a perturbed copy Y2 of the original data 
X where 

Y2=X + Z2. 

Denote the variance of X as tj\, and the Gaussian 
noise Z2 ^ N{Q, ctIctx) independent of X. 

The data owner now wishes to produce another 
perturbed copy Yi. It generates Gaussian noise Zi ^ 
N{0, a\a\), and adds it to X to obtain Yi as 

Yi = X + Zi. 

The new noise Zi is also independent of X (but could 
be designed to be correlated with Z2). We consider the 
case where the data owner chooses (t| > (j\ so that 
Yi is less perturbed than Y2. 

The privacy goal in Equation [lO] requires that 

V{X,X{Yi,Y2))^V{X,X{Y^)). (11) 

To see this, note that m:in{V{X, X{Yi)),V{X, X{Y2))) 
can be simplified to 'D{X,X{Yi)), i.e., the less per- 
turbed copy gives better estimate. 



4.2 Two False Starts 

In this section, we illustrate the challenges in achiev- 
ing the privacy goal with two false starts. 

4.2. 1 Independent Noise 

The first intuitive attempt is to generate the two 
perturbed copies independently. The added noise in 
the two perturbed copies is not only independent to 
the original data, but also independent to each other. 

In the case we consider, the above solution gener- 
ates Zi to be independent of X and Z2 respectively. 
Consequently, adversaries have two perturbed copies 
as follows: 

r Yi = x + Zi 

\ Y2=X + Z2 

where X, Z\ and Z2 are mutually independent. The 
adversaries perform a joint LLSE estimation to ob- 
tain X{Y\^Y2). Straightforward computation utilizing 
Equation [3 shows that 

^(-^'^'^-^- i + 1/.t+i/.r 

This value is strictly smaller than the error of the 
estimate based on either Y\ or Y2, which is for i = 1,2, 

i,(x,x(y.)).^. 

following Equation [S] Thus, Equation [TT] is not satis- 
fied and the desired privacy goal is not achieved. 

Example. Assume that the original dataset has sin- 
gle attribute data X with mean [ix = 10 and variance 
a\ = 1. The data owner releases perturbed copies 
Yi = X + Zi and Y2 = X + Z2 of two (sensitive) 
values X = [9, 11]^ to Alice and Bob with different 
trust levels = 1 and ct| = 4, respectively. 

Alice reconstructs the data values using Eqn. Q, 
and obtains l(Yi) = [9.5, lO.S]"^ + O.bZi. The average 
estimation error is 

^E[{X - Xf{X - X)] = 0.125E[ZfZi] + 0.25 = 0.5. 

Bob reconstructs the data values using Eqn. (H, 
and obtains X{Yi) = [9.8, 10.2]"^ + O.2Z2. The average 
estimation error is 

^E[{X - Xf{X - X)] = Q.02E{Z^Z2] + 0.64 = 0.8. 

Assume that Yi and Y2 are generated indepen- 
dently. The reconstructed data after the collusion 
between Alice and Bob using Eqn. ^ are X{Y) — 
[85,95]^/9 + 4Zi/9 + Z2/9. The average estimation 
error is 

-E[(X-X)^iX-X)] = —ZfZ, +—Z^Z2 + — = -. 
2 ' ^ '^^ 81 ^ 162 2 ^ 81 9 

Thus the collusion results in a smaller error. □ 
Intuitively, this is because the two copies of the data 
are generated independently, each containing some 
innovative information of the original data that is 
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absent from the other. When estimation is performed 
jointly, the innovative information from both copies 
can be utilized, resulting in a smaller estimation error 
and thus a more accurate estimate. 

4.2.2 Linearly Dependent Noise 

In light of the incorrectness of the first solution, one 
might consider a second approach to generate new 
noise so that it is linearly dependent to the existing 
one. 

In the case we consider, the above approach may 
generate Zi = fJ-.^2- It is easy to verify that Zi ^ 
N{0,afax). However, Yi = X + Zi again fails to 
achieve the privacy goal. 

To see this, notice that the adversaries who have 
access to both copies can reconstruct X perfectly as 
follows: 

_ a2Yi - aiY2 _ a2{X + Zi) - ai{X + Z2) 
j\ — — . 

CT2 — CTl (T2 — (Tl 

The estimation error is zero, and Equation [TT] is not 
satisfied. 

4.3 Proposed Solution 

Intuitively, Equation [TT] requires that given Yi, ob- 
serving the more perturbed Y2 does not improve the 
estimation accuracy. 

One way to satisfy Equation [TT] is to generate Zi so 
that Yi = X + Zi and Z2 — Zi are independent. To see 
why, we rewrite Y2 as 



^2 — ^1 + {Z2 — Zi) 



(12) 



If Yi and Z2 — Zi are independent, then Y2 is nothing 
but a perturbed observation of Yi. All information in 
Y2 useful for estimating X is inherited from Yi. Con- 
sequently, given Yi, Y2 provides no extra innovative 
information to improve the estimation accuracy, and 
Equation [TT] is satisfied. 

Since X and Zi (resp. Z2) are independent, Yi 
and Z2 — Zi are independent if Zi and Z2 — Zi are 
independent. The following theorem gives a sufficient 
and necessary condition for Zi and Z2 to satisfy that 
Zi and Z2 — Zi are independent. 

Theorem 1: Assume Zi ^ A^(0, o-^cr^), Z2 
iV(0, i7|cr|-), and a'f < erf. Zi and Z2 — Zi are inde- 
pendent if and only if Zi and Z2 are jointly Gaussian 
and their covariance matrix is 



9 9 



(13) 



Proof: Refer to Appendix [B] □ 
The following theorem states that Zi and Z2 — Zi 
being independent is a sufficient condition for Equa- 
tion [TT] to hold. 

Theorem 2: Given that Zi - N{0,afaj^) and Z2 
N{0,a2a-x), and < a\, if Z-y and Z2 - Z^ are 
independent, then Equation [TT] holds. 



matrix is 



Vroof. Refer to Appendix [C] □ 
Example. We now revisit the example in Sec- 
tion 14.2.11 to show that collusion does not improve es- 
timation accuracy in our scheme. Assume that Y\ and 
Y2 are generated following the proposed solution, i.e., 
Z\ and Z2 are jointly Gaussian and their covariance 

\ 1 . The reconstructed data after the 
1 4 J 

collusion between Alice and Bob using Eqn. ((D are 
X{Yi) = [9.5, 10.5]^ + 0.5Zi. The average estimation 
error is 

\e\{X - X f{X - X)] = 0.125E[Z^Zi] + 0.25 = 0.5. 

This error of joint estimation is the same as the error of 
estimation using only the least perturbed copy. Thus 
the collusion does not result in a smaller error in our 
scheme. □ 

Remark: Intuitively, since y2 is a perturbed ob- 
servation of Yi as shown in Equation [121 Y2 cannot 
provide extra innovative information to improve the 
estimation accuracy achieved by utilizing only Yi, and 
Equation [TT] is satisfied. 

This sufficient condition is key in achieving the 
privacy goal in this simple case, as well as in the 
general cases, on which we elaborate in Section [S] 

Following the above analysis, our solution to this 
simple case is as follows: 

• Given and crl, construct the covariance matrix 
of Zi and Z2 as in Equation [13] Derive the joint 
distribution of Zi and Z2. 

• Compute the conditional distribution of Zi given 
Z2. Generate Zi according to this conditional 
distribution. 

. Generate the desired Yi ~ X + Zi. 

In this way, Zi and Z2 — Z\ are guaranteed to be 
independent; hence. Equation [TT] is satisfied. 



5 Solution to General Cases 

We now show that the solutions to the general cases 
of arbitrarily fine trust levels follow naturally from 
that to the two trust level case studied in Section [4] 



5.1 Shaping the Noise 

5.^.^ independent Noise Revisited 

In Section [4] we show that adding independent noise 
to generate two differently perturbed copies, although 
convenient, fails to achieve our privacy goal. The 
increase in the number of independently generated 
copies aggravates the situation; the estimation error 
actually goes to zero as this number increases indefi- 
nitely. In turn, the attackers can perfectly reconstruct 
the original data. We formalize this observation in the 
following theorem. 
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Theorems: Let Y = [Y^ , . . . ,Y^]'^ be a vector 
containing M perturbed copies. Assume that Y is 
generated from the original data X as follows: 



Y = HX + Z. 



where H = [In, ■ ■ ■ ,In] , and Z ~ [Zf , 
with Zi 7V(0, a^-Kx) is the noise vector. 

If noise , 1 < i < Af are mutually independent, 
then the square errors between the LLSE estimate X 
and X{Y) are the diagonal terms of the following 
matrix 



M 



1 



i=l 



Kx- 



As M increases, the estimation errors decrease, so 
does the distortion V{X,X{Y)). 

Proof: Refer to Appendix [Dl □ 
Remark: The theorem says that when adding a 
new copy that is perturbed by independent noise, the 
estimation error decreases. It agrees with the intuition 
that a new independently-perturbed copy adds extra 
innovative information to improve the estimation ac- 
curacy. 

We conclude that noise Zi,l < i < AI should not 
be generated independently. 

5.1.2 Properly Correlated Noise 
We show by the case study that the key to achieving 
the desired privacy goal is to have noise Zi,l < i < M 
properly correlated. To this end, we further develop 
the pattern found in the 2x2 noise covariance 
matrix in Equation [13] into a corner-wave property 
for a multi-dimensional noise covariance matrix. 
This property becomes the cornerstone of Theorem H] 
which is a generalization of Theorem [T] and |2l 

Corner-wave Property Theorem |4] states that for AI 
perturbed copies, the privacy goal in Equation [lO] 
is achieved if the noise covariance matrix Kz has 
the corner-wave pattern as shown in Equation [TS] 
Specifically, we say that an M x M square matrix has 
the corner-wave property if, for every i from 1 to M, 
the following entries have the same value as the {i, iY^ 
entry: 

• all entries to the right of the (i,iy^ entry in row 

i, 

• all entries below the entry in column i. 
The distribution of the entries in such a matrix looks 
like corner-waves originated from the lower right 
corner. 

Theorem 4: Let Y = [Y^, ■ ■ ■ , ^mI^ represent an 
arbitrary number of perturbed copies. Assume that 
Y is generated from the original data X as follows: 



Y = HX + Z, 



where H = [In^-.-JnY , and Z = [Z/ , 
with Zi ^ N{0,a'^,Kx) is the noise vector. Without 



loss of generality, we further assume 



1. 



(14) 



Then the following equation holds 



VixJiY)) = . mm ViX,X{Y,)) = -^^IrriKx) 

1=1,. ...M (T^^ + i iV 

if Z is a jointly Gaussian vector and its covariance 
matrix Kz is given by 



Kz = 



z^Kx (Jzi^x 
IK - 



X 



<Kx 



<^iKx 
^iKx 



Kx alK 



X 



K 



X 



.(15) 



Proof: Refer to Appendix [E] □ 

Remark: The corner-wave property of Kz given in 
Equation [15] guarantees that Equation[lO|holds. There- 
fore, the diversity attack does not help to improve the 
estimation accuracy. 

Moreover, for any subset of these A/ perturbed 
copies, the covariance matrix of the corresponding 
noise also has the corner-wave property, and thus the 
privacy goal is achieved. We summarize this observa- 
tion in Corollary [T] 

Corollary 1: If the privacy goal in Equation [lOj 
is achieved with respect to M perturbed data 
Yi, . . . , Ym, then the goal is also achieved with respect 
to any subset of {Yi, . . . , Ya/}. 

Based on Theorem [4| and Corollary [TJ one way to 
achieve the privacy goal in Equation [lOj is to ensure 
that noise Z is a jointly Gaussian vector and follows 
N{Q, Kz) where Kz is given by Equation [15] We 
consider two scenarios when generating noise Z and 
the corresponding perturbed copies Y. We discuss 
these two scenarios in the following two sections. 

5.2 Batch Generation 

In the first scenario, the data owner determines the 
AI trust levels a priori, and generates AI perturbed 
copies of the data in one batch. In this case, all trust 
levels are predefined and a^^i to crfj,/ are given when 
generating the noise. We refer to this scenario as the 
batch generation. 

We propose two batch algorithms. Algorithm 1 
generates noise Zi to Zm in parallel while Algorithm 
2 sequentially. 

5.2. 1 Algorithm 1: Parallel Generation 

Without loss of generality, we assume (t|. < crf^^j^ 
where 1 < i < A/ — 1. Algorithm [T] generates the 
components of noise Z, i.e., Zi to Zm, simultaneously 
based on the following probability distribution func- 
tion, for any real (N ■ A/) -dimension vector v, 



h{v) 



x/(27r)^ dei{Kz) 



(16) 
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where Kz is given by Equation [151 

Algorithm [T] then constructs Y as HX + Z and out- 
puts it. We refer to Algorithm [T] as parallel generation. 



Algorithm 1 : Parallel Generation 

1: // Input: X, Kx, and cr|^ to cr|^^ 
2: // Output: Y 

3: Construct Kz with Kx and (7%_^ to o-^^^, according 

to Equation [T5l 
4: Generate Z with i^z, according to Equation [161 
5: Generate Y = HX + Z 
6: Output Y 



Algorithm [T] serves as a baseline algorithm for the 
next two algorithms. 

5.2.2 Algorithm 2: Sequential Generation 
The large memory requirement of Algorithm 1 mo- 
tivates us to seek for a memory efficient solution. 
Instead of parallel generation, sequentially generating 
noise Zi to Z^, each of which a Gaussian vector of N 
dimension. The validity of the alternative procedure 
is based on the insight in the following theorem. 

Theorems: Consider Z = [Z^ , . . . , Zj^y where 
Z^ ^ N{0,Kz,) with Kz, = cf'zJ<x- Without loss of 
generality, further assume 

4. <4,+i- = i,...,M-i. 

Then Z is a jointly Gaussian vector and Kx has 
the form in Equation [15l if and only if Zi, and 
{Zi — Zi^i), 1 = 2, M are mutually independent. 
Proof: Refer to Appendix (£] □ 
Based on Theorem O Algorithm 2 sequentially gen- 
erates M independent noise Zi, and {Zi — Zi^i) for i 
from 2 to M . Noise Zi is then simply (Z— Zi_i) + Zi_i 
for i from 2 to M . Finally Algorithms 2 generates 
the perturbed copies Yi to Ym by adding the corre- 
sponding noise. We refer to Algorithm 2 as sequential 
generation. 



Algorithm 2 : Sequential Generation 

1: // Input: X, Kx, and (t| to (t| 

2: // Output: Yi to Ya/ 

3: Construct Zi - N{0,al^Kx) 

4: Generate Yi = X + Zi 

5: Output Yi 

6: for i from 2 to M do 

7; Construct noise ^ ^ iV(0, ((t|. - crl. JiCx) 
8: Generate = + ^ 
9: Output Yi 
10: end for 



We now explain intuitively why the mutual inde- 
pendence requirement for Zi, and {Zi — Zi^i) for i 
from 2 to M is sufficient to achieve our privacy goal 
in Equation [lOl 



We rewrite Y, as X + Zi + Ej=2(^j " ^i-i)- Since 
X, Zi and Zj — ^j-i for j = 2, . . . ,M are mutually 
independent, Yi,2 < i < M are perturbed observa- 
tions of Yi . Intuitively all information in them that are 
useful for estimating X is inherited from Yi . As such, 
given Yi, Yi,2 < i < AI provides no extra innovative 
information to improve the estimation accuracy. Sim- 
ilar analysis applies to any subset of Yi to Ym. Hence, 
Equation [To] is satisfied. This intuition is similar to the 
explanation for the case study in Section [4] 

5.2.3 Disadvantages 

The main disadvantage of the batch generation ap- 
proach is that it requires a data owner to foresee all 
possible trust levels a priori. 

This obligatory requirement is not flexible and 
sometimes impossible to meet. One such scenario for 
the latter arises in our case study. After the data 
owner already released a perturbed copy Y2, a new 
request for a less distorted copy Yi arrives. The 
sequential generation algorithm cannot handle such 
requests since the trust level of the new request is 
lower than the existing one. In today's ever changing 
world, it is desirable to have technologies that adapt 
to the dynamics of the society. In our problem setting, 
generating new perturbed copies on-demand would 
be a desirable feature. 

5.3 On Demand Generation 

As opposed to the batch generation, new perturbed 
copies are introduced on demand in this second 
scenario. Since the requests may be arbitrary, the 
trust levels corresponding to the new copies would 
be arbitrary as well. The new copies can be either 
lower or higher than the existing trust levels. We refer 
this scenario as on-demand generation. Achieving the 
privacy goal in this scenario will give data owners 
the maximum flexibility in providing MLT-PPDM ser- 
vices. 

We assume L{L < M) existing copies of Yi to Y^. 
We also assume that the data owner, upon requests, 
generates additional M — L copies of Yl+i to Y^j. Thus 
there will be M copies in total. Note in this subsection 
(t|^ to az^j can be in any order. Finally, we define 
vectors Z' and Z" as 





" z, ' 




Zl+1 


Z' = 




and Z" = 










Zai 



According to Theorem [4l the data owner should 
generate new noise Z" in such a way that the co- 
variance matrix of Z = \ll'^Tl''^Y has corner-wave 
property, and they are jointly Gaussian. 

The desired covariance matrix Ki can be con- 
structed according to Equation [15] (after properly or- 
dering Z\ to Zm according to to cr|^^). 
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According to Section IZTl it is sufficient and neces- 
sary for the conditional distribution of Z" given that 
Z' takes any value vi to be a Gaussian with mean 

K^.^'K^.'vi (17) 

and covariance 

Kz" - Kz"Z'K-^K^„^,, (18) 

where K^' is the covariance matrix of Z', Kz"Z' is 
the desired covariance matrix between Z" and Z', and 
Kz" is the desired covariance matrix of Z". 

Note Kx' is known to the data owner, and Kz"z' 
and Kz" can be extracted from the desired covariance 
matrix A'z. We turn the above analysis into Algo- 
rithmic 

Algorithm 3 : On Demand Generation 
1: // Input: X, Kx, cr'z-^ to cr%j^^, and values of Z': 

Vl 

1: 1 1 Output: New copies Z" 

3: Construct K% with Kx and to t according 
to Equation [TSl 

4: Extract Kj^i , Kj^ni, , and K^ii from Ki 

5: Generate Z" as a Gaussian with mean and vari- 
ance in Equation [17| and [181 respectively 

6: for i from L + 1 to M do 

7: Generate = X + 

8: Output 

9: end for 



5.4 Time and Space Complexity 

In this subsection, we study the time and space 
complexity of the three algorithms. One may notice 
that all the covariance matrices of noise in the three 
algorithms, such as Equation [15] and Equation [l8l can 
be written as the Kronecker product of two matrices. 
For such covariance matrices, we have the following 
observation: 

Lemma 1: Assume that /^i and K are the mean and 
covariance matrix of the jointly Gaussian random 
vector G. If K<q = Y,q,®Kq, where Eg and Kq are Px P 
and Q X Q, respectively, and Kq is also a covariance 
matrix, then the time complexity of generating G is 
0(P3 + Q3). 

Proof: Refer to Appendix IG.ll □ 
Remark: Directly generating G using Kiq, the com- 
plexity is 0{P^Q^). Viewing A'g as a Kronecker prod- 
uct of two matrices of smaller dimensions, we can 
utilize the properties of Kronecker product to reduce 
the complexity to 0(P^ + Q^). 

The proof suggests an efficient implementation of 
the proposed three algorithms. Note that for each al- 
gorithm, the time complexity may be further reduced. 

Utilizing Lemma [H we give the following theorems 
on the time and space complexity of the proposed 
three algorithms. 



Theorem 6: Given an A^-dimensional data vector 
X, the time complexity of generating Al perturbed 
copies using Algorithm [T] is 0{N^ + AIN^), and the 
space complexity is 0{M + N'^). 

Proof: Refer to Appendix IG.2I □ 

Theorem 7: Given an A^-dimensional data vector 
X, the time complexity of generating M perturbed 
copies using Algorithm [2| is 0{N'^ + MN"^), and the 
space complexity is 0{N'^). 

Proof: Refer to Appendix IG.3I □ 

Remark: Using a similar set of arguments, we can 
show the time complexity of the independent noise 
scheme described in Section 15.1.11 is the same as 
Algorithm 2. 

Theorem 8: Given an A^-dimensional data vector 
X and P(l < L < M — 1) perturbed copies of X, 
the time complexity of generating [M — L) perturbed 
copies using Algorithm [3| is 0{M^+N^), and the space 
complexity is 0{AP + iV^). 

Proof: Refer to Appendix IG.4I □ 

Table [2| compares the applicabilities and complexity 
of the three proposed algorithms. In summary. Algo- 
rithms [T] and [21 have less space and time complexity 
than Algorithm [3l Algorithm [3] offers data owners 
maximum flexibility by generating perturbed copies 
in an on-demand fashion. 

6 EXPERIIWENTS 

6.1 IVIethodology and Settings 

We design two experiments, performance test (Exper- 
iment 1) and scalability test (Experiment 2). Experi- 
ment 1 explores answers to the following questions 
numerically: 

« How severe can LLSE-based diversity attacks be, 
given that the perturbed copies at different trust 
levels are generated independently? 
• How effective is our proposed scheme against 
LLSE-based diversity attacks, compared to the 
above independent noise scheme? 
« How does an adversary's knowledge affect the 
power of such attacks? 
Experiment 2 demonstrates the runtime of our pro- 
posed Algorithm 3. 

We run our experiments on a real dataset CENSUS 
IHTI , which is commonly used in the hterature of 
privacy preservation such as | |42j , for carrying out 
the experiments and evaluating their performance in 
a fully controlled manner. This dataset contains one 
million tuples with four attributes: Age, Education, 
Occupation, and Income. We take the first 10^ tuples 
and conduct the experiments on the Age and Income 
attributes. The statistics and distribution of the data 
are shown in Table [3] and Figure [TJ respectively. 

Given data X {Age and Income), to generate per- 
turbed copies Yi at different trust levels i, we generate 
Gaussian noise Zi according to A^(0, erf i^x)/ and add 
Zi to X. The constant a^^, represents the perturbation 
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TABLE 2 

Comparison of applicabilities, space complexity, and time complexity of three proposed algorithms. 





Batch 
Generation 


On-demand 
Generation 


Space 
Complexity 


Time 
Complexity 


Algorithm 1 


/ 




0{M + N'^) 


0(N'-^ + MN'-') 


Algorithm 2 


/ 




0(N'^) 


0(N-' + MN'^) 


Algorithm 3 


/ 


/ 


0{AP + N'^) 


0[AP + N'^) 
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Fig. 1 . Distribution of sensitive values Age and Income. 
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magnitude determined by the data owner according 
to the trust level i. The noise for different trust levels 
are generated either independently, or in a properly 
correlated manner following our proposed solution in 
Section |5l 

Data miners can access one or more perturbed 
copies Yi, either according to application scenario 
setting or by collusion among themselves. Recall our 
assumption that data miners perform joint LLSE esti- 
mation to reconstruct X. We study two classes of data 
miners with different knowledge about the original 
data and noise: 

• the first class of adversaries has perfect knowl- 
edge, i.e., the exact values of ^ix, Kx and (t|. for 
every trust level i; 

m the second class of adversaries has partial knowl- 
edge, i.e., the exact values of (t| for every trust 
level i, but not i^lx and Kx- 

To perform LLSE estimation, data miners with 
partial knowledge estimate ^x arid Kx using their 
perturbed copies. For each Yi, its mean is simply fix> 
and its covariance matrix is (l + cr^ )Kx- Knowing the 
exact values of a'^. , a data miner can estimate fix arid 
Kx using the sample mean and sample covariance 
matrix of Yi . Accuracy of such estimation depends on 
the sample size; the larger the sample size, the more 
accurate the estimation of /ix and Kx- 

TABLE 3 

Statistics of the original data Age and Income. 





Mean fix 


Variance cr^ 


Age 


50.06 


303.03 


Income 


16.57 


219.92 



In Experiment 1, we use two performance metrics, 
average normalized estimation error and distribution 
of estimation error. For LLSE estimate of X based 
on Y, i.e., X{Y), we define its normalized estimation 
error as 

V{X,X{Y)) 
Tr{Kx) 

It takes values between and 1. The smaller it is, 
the more accurate the LLSE estimation is. It generally 
decreases as more perturbed copies are used in the 
LLSE estimation. When showing the distribution of 

the estimation error, we use \J T>{X, XiY)) directly, 
and one may see how large the distortion is, compared 
to the values of the original data shown in Fig. [TJ as 
we do not normalize it. The distribution is represented 
by a histogram as well as a cumulative histogram. 
The curve of cumulative histogram starts from and 
increases to 1. The faster the curve approaches 1, 
i.e., the bigger proportion of accurate estimates, the 
better the LLSE-based diversity attack performs. We 
conduct experiments on data with two attributes (i.e., 
N = 2); however, for ease of illustration, we show the 
performance on different attributes separately. 

6.2 Experiment 1 : Performance Test 

In this subsection, we show the superiority of our 
scheme over the scheme that simply adds indepen- 
dent noise, and how data miner's knowledge affects 
the power of LLSE-based diversity attacks. Algorithm 
|3] is used for the experiment due to its maximum 
flexibility among the three proposed algorithms. 

M perturbed copies Yi, 1 < i < M, are generated 
one by one upon requests, adding independent noise 
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Fig. 2. Comparisons of average normalized estimation error of the independent noise sclneme (denoted as IN) 
and our sclneme (denoted as Ours) on tine data Age (a) and Income (b), respectively. The average normalized 
estimation error of each setting is shown as a function of the number of generated perturbed copies. Note that 
using our algorithm, the curve of attacks utilizing the least perturbed copy overlaps with the curve of attacks 
utilizing all the available M copies. Perturbation magnitude cr|. is shown as a function of perturbed copy number 
i at the bottom. 



to the original data or using our proposed Algorithm 
|3l Each request is at a different trust level with corre- 
sponding (7^ randomly generated in [0.25, 1]. Figure|2] 
shows erf. as a function of perturbed copy number i. 

We assume that data miners can access all the M 
perturbed copies. This setting represents the most 
severe attack scenario where data miners jointly es- 
timate X using all the available M perturbed copies. 
Since the perturbed copies are released one by one, 
the number of the available perturbed copies also 
increases one by one. 

We also assume that data miners with partial 
knowledge estimate jix and Kx with different sample 
sizes. In particular, we assume that they have lOOA^^, 
200N^ and SOOA^^ samples, where is the number 
of entries in Kx and = 2 in our experiments. 

Figures|2la) and (b) show the normalized estimation 
errors of both schemes as a function of the number 
of perturbed copies, on attributes Age and Income, 
respectively. 

The results of the experiments clearly show that the 
diversity gain in joint estimation reduces the normal- 
ized estimation error dramatically. While for our al- 
gorithm, we find that the estimation error drops only 
when a perturbed copy with minimum perturbation 
magnitude so far becomes available. Using our algo- 
rithm, the curve of attacks utilizing the least perturbed 



copy overlaps with the curve of attacks utilizing all 
the available M copies. The above observations imply 
that the joint estimation based on all existing copies is 
only as good as the estimation based on the copy with 
the minimum privacy, and there is no diversity gain 
in performing the LLSE estimation jointly. Moreover, 
we have verified that the estimation error matches our 
analytical result in Theorem ID 

We also find that when data miners have perfect 
knowledge, the normalized estimation error decreases 
monotonically as M increases for copies perturbed 
by independent noise. This trend indicates a perfect 
reconstruction of X when M goes to infinity It also 
confirms Theorem |3] empirically. 

On the Other hand, if the adversaries have to es- 
timate ^x and Kx from samples, i.e., the attackers 
have partial knowledge, the curve flattens and even 
slightly increases as M becomes large. This is because 
the estimation error depends not only on the number 
of perturbed copies, but also on the precision of 
fix and Kx- The estimation based on inaccurately 
estimated mx and Kx is not optimal. Consequently, 
the estimation accuracy does not always improve as 
M increases. Figure |2] also shows that adversaries 
having more samples perform better in estimating fix 
and Kx, resulting in improved overall accuracy 

Figures IS^a) and (b) show the corresponding his- 
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30, respectively, using the two different schemes. 



tograms and cumulative histograms of the estimation 
errors for AI = 5, 10, 20 and 30, using the our 
proposed scheme and the independent noise scheme. 
The cumulative histograms of our scheme approaches 
1 much slower than those of the independent noise 
scheme. This indicates that the adversaries obtain less 
accurate estimations from copies generated by our 
scheme than from those generated by the independent 
noise scheme. We also observe that as M increases, 
the cumulative histograms of our scheme are almost 
identical as expected; while those by the independent 
noise scheme approaches the vertical axis, implying 
estimation errors decrease as adversaries obtain more 
independently perturbed copies. 

In summary, the privacy goal in Section 13.41 is 
achieved in this most severe attacking scenario. 

We further verify that the perturbed copy by our 
scheme has the same utility as that by the indepen- 
dent noise scheme, if their trust levels are the same. 
We use the Iris Plant and Wisconsin Diagnostic Breast 
Cancer databases from the UCI Machine Learning 
Repository for the experiment. We measure the utili- 
ties with a decision tree classifier and a SVM classifier 
with radial basis kernel. The average accuracies over 
10-fold cross validation are reported in Fig. ID As 
seen from Fig. IH at all noise levels, the accuracies by 
the same classifier on the data perturbed by adding 
independent noise and by properly adding correlated 
noise following our scheme are identical. Therefore, 
the perturbed copies at the same trust level by differ- 
ent noise addition techniques have the same utilities. 



6.3 Experiment 2: Scalability Test 

The scalability test is conducted in MATLAB v7.6 
on a PC with 2.5GHz CPU and 2GB memory. The 
attribute Income is used as the original data. We only 
test Algorithm |3] as it offers the maximum flexibil- 
ity in generating perturbed copies and it has the 
highest time complexity among our three proposed 
algorithms. We use the independent noise scheme 
with the same settings as a baseline algorithm. Note 
that this scheme, although with less runtime, is not 
resistent to diversity attacks. 

Theorem [8] states that to generate one tuple, the 
time complexity is 0{M^ + N^). To generate T tuples 
together, some of the computation can be shared, e.g., 
generating the covariance matrix of Z". As a result, 
the total time complexity to generate T perturbed tu- 
ples is 0{M^+N'^+T{M'^N+MN'^)), and the average 
time complexity for one tuple is 0{APN + AIN^) for 
large T. 

Figure |5] shows the runtime of Algorithm |3] as 
a function of the total number of perturbed copies 
AI. For each value of A4, the data owner generates 
AI — L perturbed copies each of 10^ tuples. We set 
L = lAI/4\, [AI/2\, and [37\//4j respectively Our ob- 
servations are three-folded. First, our algorithm is fast. 
For example, generating 23 perturbed copies (AI = 30, 
L ~ [M/4:\ — 7) only takes 0.37 seconds. Second, 
the actual runtime of Algorithm |3] we observe only 
increases approximately linearly in AI. This observed 
complexity is much smaller than the theoretical upper 
bound 0{AP + + APN + AIN"^) we estimated 
in Section 15.41 Third, the runtime difference between 
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Fig. 5. The runtime as a function of the total number 
of perturbed copies A/, when the data owner generates 
M-L perturbed copies each of 10'"' tuples. The runtime 
is averaged on 100 repeated tests. 

Algorithm |3] and the independent noise scheme is 
considerably small. The time complexity of Algorithm 
|3] is the same as that of generating jointly Gaussian 
noise given the mean and covariance. One of the rea- 
sons why the independent noise scheme is marginally 
faster is that it uses an all-zero mean vector and 
diagonal covariance matrix. 

7 Conclusion and future work 

In this work, we expand the scope of additive per- 
turbation based PPDM to multi-level trust (MLT), by 
relaxing an implicit assumption of single-level trust 



in exiting work. MLT-PPDM allows data owners to 
generate differently perturbed copies of its data for 
different trust levels. 

The key challenge lies in preventing the data miners 
from combining copies at different trust levels to 
jointly reconstruct the original data more accurate 
than what is allowed by the data owner. 

We address this challenge by properly correlating 
noise across copies at different trust levels. We prove 
that if we design the noise covariance matrix to have 
corner-wave property, then data miners will have 
no diversity gain in their joint reconstruction of the 
original data. We verify our claim and demonstrate 
the effectiveness of our solution through numerical 
evaluation. 

Last but not the least, our solution allows data 
owners to generate perturbed copies of its data at 
arbitrary trust levels on-demand. This property offers 
the data owner maximum flexibility. 

We believe that multi-level trust privacy preserving 
data mining can find many applications. Our work 
takes the initial step to enable MLT-PPDM services. 

Many interesting and important directions are 
worth exploring. For example, it is not clear how to 
expand the scope of other approaches in the area of 
partial information hiding, such as random rotation 
based data perturbation, fc-anonymity, and retention 
replacement, to multi-level trust. It is also of great 
interest to extend our approach to handle evolving 
data streams. 

As with most existing work on perturbation based 
PPDM, our work is limited in the sense that it con- 
siders only linear attacks. More powerful adversaries 
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may apply nonlinear techniques to derive original [21] 
data and recover more information. Studying the 
MLT-PPDM problem under this adversarial model is ^22] 
an interesting future direction. 
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Appendix A 

Deduction of Equation |4] 

Assume that the LLSE estimate X{Y) = AY + b, 
where A and b are parameters. LLSE minimizes the 
square errors between the estimated data X{Y) and 
the original data X, i.e., 

J = ^E{Tr[iX-X{Y))iX-X{Y)r]} 
= ^E{Tr[{X - AY -b)iX - AY -bf]}. 

As J is a quadratic fimction of A and b, the optimal 
values of A and b satisfy that 



dJ 
dA 
dJ_ 
db 



= -E[{X-XiY))Y^] 
= -E[X - X{Y)]=0. 



0, 



The above equations are called the orthogonality prin- 
ciple, from which 

A - KxyKy\ 

b = E[X]-KxyKy^E[Y]. 

Thus, we have 

X{Y) = KxYKy^[Y - E[Y]] + E[Xl 
where E[X] = E[Y]^nx. 

Appendix B 

Proof of Theorem [H 

We first prove the if part of the theorem. From 
the covariance matrix of Zi and Z2, we know that 

E[ZiZ2] = crfaj.. Therefore, 

E[ZiiZ2-Zi)] = E[ZiZ2]-E[Zf] = alG\-GlG\ = 0, 

(19) 

suggesting that Z\ and Z2 — Z\ are linearly indepen- 
dent. 

Meanwhile, by definition of jointly Gaussian, Z2 — 
Z\ is also a Gaussian random variable. For Gaussian 
variables Z\ and Z2 — Z\, linear independence implies 
independence. 

We now prove the only if part of the theorem. We 
observe that Z2 = Zi + (Z2 — Zi) is sum of two 
independent Gaussian random variables. Thus, Z2 
and Z\ are jointly Gaussian by definition, and we also 
have £'[^2^1] £^[^1^2] ^'y\a\. It follows that their 
covariance matrix is as follows: 



Appendix C 

Proof of Theorem [2] 

By Theorem |5l if Z\ and Z2 satisfy that Z\ and 
Z2 — Z\ are independent, then their covariance matrix, 
denoted by Kc, must be given by 



9 9 

4^x 
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Based on Yi, the LLSE estimation of X has an 
estimation error of 



'X 



'X 



'X 



l + al l + l/a? 



(20) 



which can be computed using Equation |8l 

Similarly, based on both Yi and Y2, the LLSE esti- 
mation of X has an estimation error of 



'X 



After simplification, the above estimation error is ex- 
actly the one shown in Equation |20l Thus, Equation [TT] 
holds. 



Rewrite as the following 

Kz, Kz, 
Kz, '^Kz, ■ 



-K 



Zi 



orm 



Kz, 
hKz, 



-K 



Zi 



We find its inverse following a standard process. We 
perform row operation to the matrix [Kz \ I] until 
it has the form [/ | A\. Then matrix A is Kj^"^. 
Note the structure of Ki makes this process pretty 
straightforward and easy. 

Following above process, we find the expression of 
K-^ for the case of AI > 2 as follows: 



Appendix D 

Proof of Theorem [3] 

li Zi,l < i < AI are independent to each other, then 
Kz is given by 







a\ Kx 



K. 

Zl J 







-C2K-^ 

(C2 + C3)K-l 








(22) 



where 



l< i < M - 1. 



By Equation [71 the estimation errors are the diagonal 
terms of the following matrix 



K 



X- 



Appendix E 

Proof of Theorem [4] 

By the definition of distortion and the result shown 
in Equation [71 we have 



V{X,X{Y)) 
and for i = 1 , • . ■ , M , 

V{X,X{Y,)) 



1 + 



Tr {K_ 



X) 



Two observations can be made for the above 
two equations. First, we must have V{X, X{Yi)) < 
2?(X, X{Yi^i)) due to the assumption on cr^. in Equa- 
tion El and 



mill V{X,X{Y,)) ^V{X,X(Yi)) 



Tr{K_ 



X 



i=l,...,M 



1 



N 



Second, the proof is complete if we can show that 

H^K^^H ^ Kz^. (21) 
This obviously holds for the case of 7\/ = 1. 



It is straightforward to verify the product of Ki and 
the above matrix is an identity matrix. 

Noticing that K^^ only have non-zero entries in the 
main diagonal and two adjacent diagonals, and that 
its column and row sums are zero except the first row 
and column, we have 



H'^K^^H = [ Kzl 



'AT 



= K 



Zl ' 



and the proof is complete. 

Appendix F 

Proof of Theorem [5] 

We first prove the if part of the theorem. Since Zi to 
Zm are jointly Gaussian variables, Zi, and (Zi — Zi^i) 
for are also jointly Gaussian variables. This is because 
any linear combination of them is simply another 
linear combination of Zi to Znj, and is thus a Gaus- 
sian. For jointly Gaussian variables, they are mutually 
independent if their covariance matrix is a diagonal 
matrix. This can be easily verified by evaluating their 
joint distribution. 

From the covariance matrix of Z, we know that for 
j > i, E[Z,Zj] = Kz,. For 2 < i < j < AI, we have 

E[iZ, - Z,^,)iZ, - Z,^,f] 
= E[Z,Zj] - E[Z,Zj_,] - E[Z,^iZj] + E[Z,^iZj_,] 



K 



Z,-i 



+ Kz,_, =0. 
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We also have for 2 < i < M, 

E[Zi{Z,^ Z.^if] = E[ZiZj] - E\ZiZ,^if 
- Kz, - Kz, = 0. 

As such, we must have the covariance matrix of Zi, 
and [Zi — Zi^i) for to be diagonal, and they are 
mutually independent. 

We now prove the only if part of the theorem. Since 
Zi, and {Zi — Zi-i) for i from 2 to M are mutually 
independent Gaussian variables, we must have Zi to 
Zm to be jointly Gaussian. This is because each of 
them is simply a linear combination of independent 
Gaussian variables. 

We also have for j > i, 



E\Z,Zi] = E 



Z,. 



Z, + £ - 



l=i+l 
j 



E\Z,,Zf 



- ^ E[Z,{Zi - Zi-i) 

l=i+l 



K 



Zi- 



It follows that Ki must have the form as in Equa- 
tion [l5l 

Appendix G 

Time and Space Complexity 

For ease of discussion, we summarize the time com- 
plexity of several basic operations as follows: 

• Multiplication of two matrices: the complexity 
of multiplication of an Pi x P2 matrix and an P2 x 
P3 matrix is ©(PiP^^s) by direct computation. 

• Cholesky decomposition oi PxP matrices: time 
complexity is 0{P^) |43l pp. 245]. 

G.I Proof of LemmaQ] 

To generate jointly Gaussian random vector G, the 
standard routine f^lj generates independent zero- 
mean unit-variance Gaussian vector G and then uses 
a linear transformation 

G^^iG + icN, (23) 

where Kq = LqLq is the Cholesky decomposition |43l 
pp. 245] of Kg. 

If both Eg and Kq are positive semi-definite, we can 
perform the Cholesky decomposition as Eg = L^L]^ 
and A'o = LqLq, and then 

Thus the Cholesky decomposition of Kg, can be ex- 
pressed s Lq = Lq. 

Following that, Equation |23] can be written as 

G = + (is ® Lo)G = MG + vec(LoNQxpis)- (24) 

where Nqxp is a Q x P matrix satisfying that 

vec(NQxp) = N. 



The total time complexity is the sum of the com- 
plexity of generating PQ independent zero-mean 
unit-variance Gaussian random variables^ 0{PQ), the 
Cholesky decomposition 0{P^ + Q^), the matrix mul- 
tiplication 0(Q^P + P^Q), and the vector addition 
0(PQ),4 i.e., 0(P3 + 03). 

To complete the proof, the remaining part of the 
proof shows that Eg is positive semi-definite given 
that Kq and Kq are covariance matrices and A'g = 
Eg ® Kq. 

The definition of positive semi-definite matrices 
suggests that F'^K^F > for an arbitrary column 
vector F. Without loss of generality, we assume that 
the element A'q ( 1 , 1 ) in the top-left corner of Kq is pos- 
itive. Then we let Pi be composed of all the {iQ + 1)- 
th (i = 0, ...,P — 1) elements of P and let the other 
elements of P be zero, and thus Fl'{Ko{l, 1)Eg)Pi = 
F'^KqF > 0. It is straightforward that Eg is positive 
semi-definite as Pj^EgPi > for any Pi. 

G.2 Proof of Theorem |6] 

With the technique in the proof of Lemma [TJ generat- 
ing M A^-dimensional jointly Gaussian noise vector Z 
can use Equation l24l 

In Algorithm 1, the covariance matrix of Z is A'z = 
Ez Kx, where 







^? ■ 
















^1 ■ 





It is easy to verify that for the Cholesky decomposi- 
tion Ez = Ls^lJ, 

Pe = C/diag(K), 

where U is the lower triangular part (including 
the diagonal) of all-one M x M matrix, = 

lal, - cr\j_^], and diag(K) is a di- 



agonal matrix with the vector Va as its diagonal. 
Thus Equation |24] can be written as 

Z = /iz + vec(Lx(GjvxA/diag(V^,))P^), (25) 

The matrix multiplication in Equation |25] can be 
split into three steps, as shown with the brackets, with 
a time complexity 0{MN), 0{MN), and O(N'^M), 
respectively. 

As a result. Algorithm 1 has a time complexity of 
0{N^ + MN^). 

The space complexity is 0{AI + N"^), as Algorithm 
1 has to store the noise levels cr\, a^j and the 
covariance matrix Kx- 

3. PQ independent zero-mean unit-variance Gaussian random 
variables can be generated using a standard algorithm, e.g., 1451 . 

4. Since the time complexity of generate PQ independent zero- 
mean unit-variance Gaussian random variables and the vector ad- 
dition are both 0{PQ), which can be bounded by 0{P'^Q + PQ'^), 
we omit the complexity of them in the proof of Theorems [6}0 
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G.3 Proof of Theorem |7] 

According to Lemma [TJ when generating one per- 
turbed copy of X, the Cholesky decomposition of Kx 
has a time complexity of 0{N^), and the rest part costs 
0{N'^). As the Cholesky decomposition of Kx can be 
reused for different copies, generating M perturbed 
copies of X only has a time complexity 0{N^+MN'^). 

Algorithm 2 only requires a memory of size 0{N'^) 
for the covariance matrix Kxr as the noise levels of 
(i = I, M) are input sequentially. 

G .4 Proof of Theorem |8] 

Algorithm 3 first constructs the (MN) x (MN) matrix 
Kz in 0{MN) time. It then computes a mean and 
variance according to Equations [17| and [18l 

Note that Equations [17] and [18] can be written as 
[(Ez,,z,E^/)®/Ar]vi and (Ez^ -I]z-,z,E^/E|'„2,)®^x, 
respectively, where Kin = Ez" ® Kx, K%i = Ez' ® Kx 
and /fz"Z' = Ez"Z' ® Kx- E^/ has been given in an 
explicit form in Equation [22] So the time complexity 
of computing the mean and the Kronecker product 
form of the covariance matrix are 0((M — L)i^ + (J\/ — 
L)LN) and 0((Af - V)L'^ + (M - LfL), respectively 

At the end. Algorithm 3 generates (A/ — L) jointly 
Gaussian variables with the computed mean and 
covariance matrix, and outputs (M — L) perturbed 
copies. According to Lemma [T] the time complexity 
is 0((M - Vf + + [M - LfN + [M - L)N^). 

For any value of L, the time complexity of Algo- 
rithm 3 is bounded by 0{AP + + APN + MN^), 
which can be further simplified to 0{M^ + N^). 

For Algorithm 3, it requires 0{AP + N"^) memory 
to store the covariance matrix K^. 



